Melbourne Housing Market - Linear Regression (Part 2)

Summary

In this notebook, I will adding on a few additional analyses from the previous notebook of applying linear regression to model price with the various variables. I have included 3 types of feature selection process - Correlation Statistics, Mutual Information Statistics and K-fold Cross Validation - to determine the best number of variables that could improve the model.

The main components of this notebook can be split into:

  1. Continuation from the previous notebook
  2. Feature Selection using Correlation Statistics
  3. Feature Selection using Mutual Information Statistics.
  4. Model Evaluation using MAE, MSE, RMSE and R^2

*This notebook is copied and adapted from https://www.kaggle.com/anthonypino/price-analysis-and-linear-regression.

From Part 1

1. Data Cleaning

  1. Convert arguments in Date column to datetime
  2. Filter out data that are not housing types

2. Data Exploration using Visualisations

  1. Histogram plot for each variable
  2. Pair plots
  3. Observe average price change per quarter over the years

Analysis:

  1. The housing prices in Melbourne appears to begin cooling off sometime between April and July in 2017.
  2. Based on the correlation matrix, the top 2 features that affects pricing is the number of Bathrooms, nunber of Bedrooms and distance (kilometres) from CBD. I plotted boxplots to visualise how price varies the number of bedrooms and bathrooms. The boxplot for the number of bedrooms indicate that there's quite alot of variability. For distance, I used a regression plot to see how price varies. The plot shows a negative relationship between the two, which is logical since housing near CBD are usually priced higher than those in the outer regions.

3. Linear Regression Model with all Features

In this part, I will evaluate the linear regression model using all the available features. The data is split into training and test data with a 2:1 ratio. The coefficient for each predictor variable is subsequently ranked after, showing that longitude, number of bathrooms and the vendor bid method as the top 3 most significant feature in the model.

4. Visualising Regression Models

Distribution plot: difference in actual price and predicted price

(End of Part 1)

Part 2 - Feature Selection

Mutual Information Statistics

This model leverages on the correlation (most common correlation measure being pearsons correlation) to determine which variable is the most relevant.

Correlation Statistics

This model leverages on the correlation (most common correlation measure being pearsons correlation) to determine which variable is the most relevant.

Visualising Regression Models

Model Evaluation

By applying two types of feature selection techniques and comparing the models, the metrics indicate that mutual information statistics allow us to to achieve a more accurate model - higher R^2 and lower error metrics (MAE, MSE and RMSE).